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Abstract. We show that a simple algorithm for computing a matching 
on a graph runs in a logarithmic number of phases incurring work lin- 
ear in the input size. The algorithm can be adapted to provide efficient 
algorithms in several models of computation, such as PRAM, Exter- 
nal Memory, MapReduce and distributed memory models. Our CREW 
PRAM algorithm is the first O (log'^ n) time, linear work algorithm. Our 
experimental results indicate the algorithm's high speed and efficiency 
combined with good solution quality. 



1 Introduction 

A matching M of a graph G = (V, E) is a subset of edges such that no two 
elements of M have a common end point. Many applications require the compu- 
tation of matchings with certain properties, like being maximal (no edge can be 
added to M without violating the matching property), having maximum cardi- 
nality, or having maximum total weight X]eeM^(^)- Although these problems 
can be solved optimally in polynomial time, optimal algorithms are not fast 
enough for many applications involving large graphs where we need near linear 
time algorithms. For example, the most efficient algorithms for graph partition- 
ing rely on repeatedly contracting maximal matchings, often trying to maximize 
some edge rating function w. Refer to for details and examples. For very 
large graphs, even linear time is not enough - we need a parallel algorithm with 
near linear work or an algorithm working in the external memory model |23j . 

Here we consider the following simple local max algorithm [T^ : Call an edge 
locally maximal, if its weight is larger than the weight of any of its incident 
edges; for unweighted problems, assign unit weights to the edges. When com- 
paring edges of equal weight, use tie breaking based on random perturbations of 
the edge weights. The algorithm starts with an empty matching M. It repeat- 
edly adds locally maximal edges to M and removes their incident edges until no 
edges are left in the graph. The result is obviously a maximal matching (every 
edge is either in M or it has been removed because it is incident to a matched 
edge). The algorithm falls into a family of weighted matching algorithms for 
which Preis |21| shows that they compute a 1/2-approximation of the maximum 
weight matching problem. Hoepman derives the local max algorithm as a 
distributed adaptation of Preis' idea. Based on this, Manne and Bisseling [IB] 
devise sequential and parallel implementations. They prove that the algorithm 
needs only a logarithmic number of iterations to compute maximal matchings 



by noticing that a maximal matching problem can be translated into a maximal 
independent set problem on the line graph which can be solved by Luby's algo- 
rithm [15'. However, this does not yield an algorithm with linear work since it is 
not proven that the edge set indeed shrinks geometri callyQ Manne and Bissel- 
ing also give a sequential algorithm running in time 0(TOlogZ\) where A is the 
maximum degree. On a NUMA shared memory machine with 32 processors (SGI 
Origin 3800) they get relative speedup < 6 for a complete graph and relative 
speedup w 10 for a more sparse graph partitioned with Metis. Since this graph 
still has average degree ~ 200 and since the speedups are not impressive this is a 
somewhat inconclusive result when one is interested in partitioning large sparse 
graphs on a larger number of processors. 

Parallel matching algorithms have been widely studied. There is even a book 
on the subject [14j but most theoretical results concentrate on work-inefficient 
algorithms. The only linear work parallel algorithm that we are aware of is a ran- 
domized CRCW PRAM algorithm by Israeli and Itai T2] which runs in 0(log n) 
time and incurs linear work. Their algorithm, which we call IIM, provably re- 
moves a constant fraction of edges in each iteration. 

Fagginger Auer and Bisseling [6] study an algorithm similar to [12] which we 
call red-blue matching (RBM) here. They implement REM on shared memory 
machines and CPUs. They prove good shrinking behavior for random graphs, 
however, provide no analysis for arbitrary graphs. 

Our contributions. We give a simple approach to implementing the local 
max algorithm that is easy to adapt to many models of computation. We show 
that for computing maximal matchings, the algorithm needs only linear work on 
a sequential machine and in several models of parallel computation (Section [J). 
Moreover it has low I/O complexity on several models of memory hierarchies. 

Our CRCW PRAM local max algorithm matches the optimal asymptotic 
bounds of IIM. However, our algorithm is simpler (resulting in better constant 
factors), removes higher fraction of edges in each iteration (IIM's proof shows 
less than 5% per iteration, while we show at least 50%) and our analysis is a 
lot simpler. We also provide the first CREW PRAM algorithm which runs in 
O(log^rt) time and linear work0 

In Section [3] we explain how to implement local max on practical massively 
parallel machines such as MPI clusters and CPUs. Our experiments indicate 
that the algorithm yields surprisingly good quality for the weighted matching 
problem and runs very efficiently on sequential machines, clusters with reason- 
ably partitioned input graphs, and on CPUs. Compared to RBM, the local max 
implementations remove more edges in each iteration and provide better quality 
results for the weighted case. Some of the results presented here are from the 
diploma thesis of Marcel Birn [5] . 



^ Manne and Bisseling show such a shrinking property under an assumption that 

unfortunately does not hold for all graphs. 
^ While a generic simulation of IIM on the CREW PRAM model will result in a 

0{\og^ nj time algorithm, the simulation incurs 0(71- log n) work due to sorting. 
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2 Parallel Local Max 



Our central observation is: 

Lemma 1. Each iteration of the local max algorithm for the unit weight case 
removes at least half of the edges in expectation. 

Proof. Consider the graph remaining in the currently considered iteration where 
d{v) denotes the degree of a node and m the remaining number of edges. Consider 
the end point at node v of an edge {m, v} as marked if and only if some edge 
incident to v becomes matched. Note that an edge is removed if and only if at 
least one of its end points becomes marked. Now consider a particular edge e = 
{u, v}. Since any of the d{u) + d{v) — 1 edges incident to u and v is equally likely 
to be locally maximal, e becomes matched with probability l/{d{u) + d{v) — 
If e is matched, this event is responsible for setting d{u) + d{v) marks, i.e., the 
expected number of marks caused by an edge is {d{u)+d{v)) / {d{u)+d{v) — l) > 1. 
By linearity of expectation, the total expected number of marks is at least m. 
Since no edge can have more than two marks, at least m/2 edges have at least 
one mark and are thus deletedQ ■ 

Assume now that each iteration can be implemented to run with work linear 
in the number of surviving edges (independent of the number of nodes). Working 
naively with the expectations, this gives us a logarithmic number of rounds and a 
geometric sum leading to linear total work for computing a maximal matching. 
This can be made rigorous by distinguishing good rounds with at least m/4 
matched edges and bad rounds with less matched edges. By Markov's inequality, 
we have a good round with constant probability. This is already sufficient to 
show expected linear work and a logarithmic number of expected rounds. We 
skip the details since this is a standard proof technique and since the resulting 
constant factors are unrealistically conservative. An analogous calculation for 
median selection can be found in [18] Theorem 5.8]. One could attempt to show 
a shrinking factor close to 1/2 rigorously by showing that large deviations (in the 
wrong direction) from the expectation are unlikely (e.g., using Martingale tail 
bounds). However this would still be a factor two away from the more heuristic 
argument in Footnote 4 and thus we stick to the simple argument. 

There are many ways to implement an iteration which of course depend on 
the considered model of computation. 

Sequential Model. For each node v maintain a candidate edge C[v], origi- 
nally initialized to a dummy edge with zero weight. In an iteration go through all 
remaining edges e = {u,v} three times. In the first pass, if w{e) > w{C[u]) set 
C[m]:= e (add random perturbation to w{e) in case of a tie). If ^(e) > w{C[v]) 

^ For this to be true, the random noise added for tie breaking needs to be renewed in 
every iteration. However, in our experiments this had no noticeable effect. 

* This is a conservative estimate. Indeed, if we make the (over)simplified assumption 
that m marks are assigned randomly and independently to 2m end points, then only 
one fourth of the edges survives in expectation. Interestingly, this is the amount of 
reduction we observe in practice - even for the weighted case. 
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set C [?;]:= e. In the second pass, if C[u] ^ C[v] ~ e put e into the matching 
M. In the third pass, if u or u is matched, remove e from the graph. Otherwise, 
reset the candidate edge of u and v to the dummy edge. Note that except for 
the initiahzation of C which happens only once before the first iteration, this 
algorithm has no component depending on the number of nodes and thus leads 
to linear running time in total if Lemma [T] is applied. 

CRCW PRAM Model. In the most powerful variant of the Combining 
CRCW PRAM that allows concurrent writes with a maximum reduction for 
resolving write conflicts, the sequential algorithm can be parallelized directly 
running in constant time per iteration using m processors. 

MapReduce Model. The CRCW PRAM resuh together with the simula- 
tion result of Goodrich et al. [9 immediately implies that each iteration of local 
max can be implemented in 0{logj^,j n) rounds and ©(mlogj^j n) communication 
complexity in the MapReduce model, where M is the size of memory of each 
compute node. Since typical compute nodes in MapReduce have at least f2{m'^) 
memory [13j . for some constant e > 0, each iteration of local max can be per- 
formed in MapReduce in constant rounds and linear communication complexity. 

External Memory Models. Using the PRAM emulation techniques for 
algorithms with geometrically decreasing input size from |3j Theorem 3.2] the 
above algorithm can be implemented in the external memory [T] and cache- 
oblivious T] models in 0{sort{rn)) I/O complexity, which seems to be optimal. 

2.1 0(\o^ n) work-optimal CREW solution 

In this section, we present a ©(log^n) CREW PRAM algorithm, which incurs 
only 0{n + m) work. 

Converting one representation of a graph into another can require as many as 
fi{m\ogm) operations (e.g. the conversion from an unordered list of edges into 
adjacency list representation requires sorting the edges). To perform matching 
in 0{n + to) work, we define a graph representation suitable for our algorithm. 

Consider an array V of n elements, where each entry V[i] — J2j<i^^9i'^j) 
{deg{vj) denotes the degree of node Vj). An adjacency array representation of a 
graph is an array A, where entries A[V[?]] through k[V[i + 1] — 1] are associated 
with vertex Vi G G and store the edges incident on Vi. 

We consider the following slightly altered adjacency array representation: the 
edges are stored in a separate array E and the entries A[V[i]] through A[V[i -I- 1] — 1] 
store the pointers to the corresponding edges in E (see Figure [1]). Thus, we can 
view each entry of A as a tuple {v, e^), where v is a, node in G and Ck is a pointer 
to a record E[fc] with information about the edge incident on v, such as the two 
vertices of the edge, edge weight, or any other auxiliary information. 

Note that any edge E[k] = {vi,Vj} contains two corresponding entries in A 
pointing to it: [vi, Ck) and {vj, e^). During our algorithm, a processor responsible 
for (wi,efc) might need to find and update entry {vj,ek) (and vice versa). The 
following lemma describes how to compute for each entry {vi, eu) the address of 
the corresponding entry {vj,ek) in A. 
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VI 



V= (0,2,4,8,11,12) 

A= (0,1,1, 2,0, 2,3,4, 3, 5,6, 5,4,6) 

E = {{vo,V2}, {vo,Vi}, {vi,V2}, {V2,V3}, {u2, U5}, {^3, ^4}, {v3,Vs}} 



Fig. 1. Adjacency representation of a grapli: Array E is a collection of edges. Entries 
of A[V[i]] through A[V[i + 1] — 1] point to edges in E incident on vertex Vi. 

Lemma 2. For every edge — {vi,Vj} G E entries {vi,ek) and {vj,ek) of k 
can compute each other's index in A in 0{1) time and 0(|A|) work in the CREW 
PRAM model. 

Proof. For every E[fc] = {vi,Vj} we show how {vj,ek) can compute the address 
of the corresponding entry (vi,ek) in V for i < j. The addresses for the other 
half of the entries are computed symmetrically. 

The algorithm proceeds in two phases. In the first phase, each entry [vi, Cfc), 
writes the address of {vi,ek) in E[fc] = {vi,Vj}, iff z < j. In the second phase, 
each entry {vj, Ck) reads the address of (wi, e^.) from E[k] = {vi, Vj} iff j > i. 

If we assign a separate processor to each entry of A, each processor performs 
only 0(1) steps. Moreover, there are no concurrent writes because, at each step 
only one of the two vertices of the edge writes to E[fc]. Note, we need a 
concurrent read to E[k] = {vi,Vj} to determine the relative order of i and j for 
Vi and Vj. ■ 

Lemma 3. Using our graph representation, each node v in the graph can apply 
an associative operator © to all edges incident on v in ©(log |A|) time and C(|A|) 
work on the CREW PRAM model. 

Proof. First, we read for each entry (u, efc) G A the value from E[fc] on which 
to apply the operator. Next, we run segmented prefix sums with operator on 
these values, where segments are the portions of A representing the neighbors of 
a single node. Finally, each entry of {v,ek) G A applies its result of segmented 
prefix sums to the edge E[fc], while using the technique of Lemma[2]to avoid write 
conflicts. Each step of the algorithm can be implemented in ©(log |A|) time using 



Now we are ready to describe the solution to the matching problem. We 
perform the following in each phase of the local max algorithm. 

1. Each edge Ck G E picks a random weight Wk- 

2. Using Lemma ini each vertex v identifies the heaviest edge incident on v 
by applying the associative operator MAX to the edge weights picked in the 
previous step. 



C(|A|) work. 
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3. Using Lemnia[21 each entry (vi, Ck) checks if E[fc] — {vi, Vj} is also the heav- 
iest incident edge on Vj. If so and i < j, Vi adds ek to the matching and sets 
the deletion flag f — 1 on E[fc]. 

4. Using Lemma m each entry {vi,ek) spreads the deletion flag over all edges 
incident on Vi by applying max associative operator on the deletion flags of 
incident edges on Vi. Thus, if at least one edge incident on Vi was added to 
the matching, all edges incident on Vi will be marked for deletion. 

5. Now we must prepare the graph representation for the next phase by re- 
moving all entries of E and A marked for deletion, compacting E and A and 
updating the pointers of A to point to the compacted entries of E. To perform 
the compaction, we compute for each entry E[A:], how many entries E[i] and 

< k must be deleted. This can be accomplished using parallel prefix 
sums on the deletion flags of each entry in E and A. Let the result of prefix 
sums for edge E[k] be dk and for entry k[i] be r^. Then k — dk is the new 
address of the entry E[k] and i — Vi is the new address of once all edges 
marked for deletion are removed. 

6. Each entry E[k] that is not marked for deletion copies itself to E[k — dk]- 
The corresponding entry (f, Cfc) G A updates itself to point to the new entry 
E[fc — dfc], i.e., {v,ek) becomes {v,ek-dk)i and copies itself to k[i — r^]. 

The algorithm defines a single phase of the local max algorithm. Each step 
of the phase takes at most O(log(m -I- n)) — 0{\ogn) time and 0{n + m) work 
in the CREW PRAM model. Over O(logm) phases, each with geometrically 
decreasing number of edges, the local max takes ©(log^ n) time and 0{n + m) 
work in the CREW PRAM model. 

3 Implementations and Experiments 

We now report experiments focusing on computing approximate maximum weight 
matchings. We consider the following families of inputs, where the first two 
classes allow comparison with the experiments from [1^ . 

Delaunay Instances are created by randomly choosing n = 2^ points in the 
unit square and computing their Delaunay triangulation. Edge weights are Eu- 
clidean distances. 

Random graphs with n := 2^ nodes, an edges for a = {4, 16, 64}, and random 
edge weight chosen uniformly from [0, 1]. 

Random geometric graphs with 2^ nodes (rggx). Each vertex is a random 
point in the unit square and edges connect vertices whose Euclidean distance is 
below 0.55 In n/n. This threshold was chosen in order to ensure that the graph 
is almost connected. 

Florida Sparse Matrix. Following (6j we use 126 symmetric non-0/1 matrices 
from [3] using absolute values of their entries as edge weights, see Appendix for 
the full list. The number of edges of the resulting graphs m € (0.5 . . . 16) x 10^. 
See Appendix [Bl for a detailed list. 
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Graph Contraction. We use the graphs considered by KaFFPa for partition- 
ing graphs from the lO'th DIMACS Implementation ChaUenge 

We compare implementations of local max, the red-blue algorithm from [5] 
(RBM) (their implementation), heavy edge matching (HEM) [8], greedy, and the 
global path algorithm (GPA) [T7]. HEM iterates through the nodes (optionally 
in random order) and matches the heaviest incident edge that is nonadjacent to 
a previously matched edge. The greedy algorithm sorts the edges by decreasing 
weights, scans them and inserts edges connecting unmatched nodes into the 
matching. GPA refines greedy. It greedily inserts edges into a graph G2 with 
maximum degree two and no odd cycles. Using dynamic programming on the 
resulting paths and even cycles, a maximum weight matching of G2 is computed. 

Sequential and shared-memory parallel experiments were performed on an 
Intel 17 920 2.67 GHz quad-core machine with 6 GB of memory. We used a 
commodity NVidia Fermi GTX 480 featuring 15 multiprocessors, each containing 
32 scalar processors, for a total of 480 CUDA cores on chip. The GPU RAM is 
1.5 GB. We compiled all implementations using CUDA 4.2 and Microsoft Visual 
Studio 2010 on 64-bit Windows 7 Enterprise with maximum optimization level. 



3.1 Sequential Speed and Quality 

We compare solution quality of the algorithms relative to GPA. Via the exper- 
iments in |17) this also allows some comparison with optimal solutions which 
are only a few percent better there. Figure [5] shows the quality for Delaunay 
graphs (where GPA is about 5 % from optimal [17]). We see that local max 
achieves almost the same quality as greedy which is only about 2 % worse than 
GPA. HEM, possibly the fastest nontrivial sequential algorithm is about 13 % 
away while RBM is 14 % worse than GPA, i.e., HEM and RBM almost double 
the gap to optimality of local max. Looking at the running times, we see that 
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Fig. 2. Ratio of the weights computed by GPA and other algorithms for Delaunay 
instances and running times. 
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Fig. 3. Ratio of tiie weigiits computed by GPA and otiier sequential algoritiims for 
sparse matrix instances and running time. 



HEM is the fastest (with a surprisingly large cost for actually randomizing node 
orders) followed by local max, greedy, GPA, and RBM. From this it looks like 
HEM, local max, and GPA are the winners in the sense that none of them is 
dominated by another algorithm with respect to both quality and running time. 
Greedy has similar quality as local max but takes somewhat longer and is not so 
easy to parallelize. RBM as a sequential algorithm is dominated by all other al- 
gorithms. Perhaps the most surprising thing is that RBM is fairly slow. This has 
to be taken into account when evaluating reported speedups. We suspect that a 
more efficient implementation is possible but do not expect that this changes the 
overall conclusion. In Appendix 1X1 we report similar results for the rgg instances 
(Figure E]) and random graphs (Figures [71 El [9]) . 
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Looking at the wide range of instances in the Florida Sparse Matrix collection 
leads to similar but more complicated conclusions. Figure [3] shows the solution 
qualities for greedy, local max, RBM and HEM relative to GPA. RBM and even 
more so HEM shows erratic behavior with respect to solution quality. Greedy 
and local max are again very close to GPA and even closer to each other although 
there is a sizable minority of instances where greedy is somewhat better than 
local max. Looking at the corresponding running times one gets a surprisingly 
diverse picture. HEM which is again fastest and RBM which is again dominated 
by local max are not shown. There are instances where local max is considerably 
faster then greedy and vice versa. A possible explanation is that greedy becomes 
quite fast when there is only a small number of different edge weights since then 
sorting is a quite easy problem. 

Experiments on the graph contraction instances in [5] show local max about 
1 % away from GPA. For these instances the average fraction of remaining edges 
after an iteration is well below 25 %. Notable exceptions are the graphs add20 
and memplus which both represent VLSI circuits. Nevertheless, none of the 
instances considered required more than 10 iterations. 

3.2 Distributed Memory Implementation 

Our distributed memory parallelization (using MPI) on p processing elements 
(PEs or MPI processes) assigns nodes to PEs and stores all edges incident to 
a node locally. This can be done in a load balanced way if no node has degree 
exceeding m/p. The second pass of the basic algorithm from Section [2] has to 
exchange information on candidate edges that cross a PE boundary. In the worst 
case, this can involve all edges handled by a PE, i.e., we can expect better 
performance if we manage to keep most edges locally. In our experiments, one 




2 4 8 16 32 64 128256 2 4 8 16 32 64 128256 



Number of processes p Number of processes p 

Fig. 4. Scaling results of the parallel local max algorithm on random geometric graphs 
with random edge weights. Left: rgg23 (~63 million edges). Right: rgg24 (~ 132 million 
edges) . 
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PE owns nodes whose numbers are a consecutive range of the input numbers. 
Thus, depending on how much locahty the input numbering contains we have 
a highly local or a highly non-local situation. We have not considered more 
sophisticated ways of node assignment so far since our motivating application 
is graph partitioning/clustering where almost by definition we initially do not 
know which nodes form clusters - this is the intended output. Since Lemma [T] 
also applies to the subgraph relevant for a particular PE, we can expect that the 
graph shrinks fairly uniformly over the entire network. 

We performed experiments on two different clusters at the KIT computing 
center both using compute-nodes with two quad-core processors each. Refer to 
[2] for details. We ran experiments with up 128 compute-nodes corresponding to 
1024 cores with one MPI process per core. 

Figure m illustrates how our distributed local max implementation scales for 
the random geometric graphs rgg23 and rgg24 (using random edge weights) 
which have fairly good locality. We plot the decrease in running time for suc- 
cessive doubling of p, i.e., a value of two stands for perfect relative speedup for 
this step and a value below one means that parallelization no longer helps. We 
see values slightly below two for the steps 1 — 2 and 2^-4 which is typical 
behavior of multicore algorithms when cores compete for resources like memory 
bandwidth. For p = 8 we start to use two compute-nodes (with 4 active cores 
each) and consequently we see the largest dip in efficiency. Beyond that, we have 
almost perfect scaling until the problem instance becomes too small. We have 
similar behavior for other graphs with good locality. For graphs with poor lo- 
cality, efficiency is not very good. However the ratios stay above one for a very 
long time, i.e., it pays to use parallelism when it is available anyway. This is the 
situation we have when partitioning large graphs for use on massively parallel 
machines. Considering that the matching step in graph partitioning is often the 
least work intensive one in multi-level graph partitioning algorithms we conclude 
that local max might be a way to remove a sequential bottleneck from massively 
parallel graph partitioning. Refer to J? for additional data. 

3.3 GPU Implementation 

Our GPU algorithm is a fairly direct implementation of the CRCW algorithm. 
We reduce the algorithm to the basic primitives such as segmented prefix sum, 
prefix sum and random gather /scatter from/to GPU memory. As a basis for our 
implementation we use back40computing library by Merrill [TH] . 

Figure [5] compares the running time of our implementation with GPA, se- 
quential local max, the REM algorithm parallelized for 4 cores, and its GPU 
parallelization from [6] . While the CPU implementation has troubles recovering 
from its sequential inefficiency and is only slightly faster than even sequential 
local max, the GPU implementation is impressively fast in particular for small 
graphs. For large graphs, the GPU implementation of local max is faster. Since 
local max has better solution quality, we consider this a good result. Our GPU 
code is up to 35 times faster than sequential local max. We may also be able 
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Fig. 5. Running time of sequential and GPU algorithms for Delaunay instances. 



to learn from the implementation techniques of RBM GPU for small inputs in 
future work. 

For random geometric graphs and random graphs, we get similar behavior 
(see Figure [TUl and [TT] in Appendix [M . The results for rgg are slightly worse for 
GPU local max - speedup is up to 24 over sequential local max and a speed ad- 
vantage over GPU RBM only for the very largest inputs. As for random graphs, 
the denser the graph is the larger is our speedup over the sequential and GPU 
RBM implementations. Thus, for a = 64 our implementation is faster than GPU 
RBM for n = aheady. While for n = 2^^ it is 65% faster than GPU RBM 
and 30 times faster than the sequential local max. 



4 Conclusions 



The local max algorithm is a good choice for parallel or external computation 
of maximal and approximate maximum weight matchings. On the theoretical 
side it is provably efficient for computing maximal matchings and guarantees a 
1/2-approximation. On the practical side it yields better quality at faster speed 
than several competitors including the greedy algorithm and RBM. Somewhat 
surprisingly it is even attractive as a sequential algorithm, outperforming HEM 
with respect to solution quality and other algorithms with respect to speed. 

Many interesting question remain. Can we omit re-randomization of edge 
weights when computing maximal matchings? Is there a linear work parallel al- 
gorithm with polylogarithmic execution time that computes 1/2-approximations 
(or any other constant factor approximation). Can we even do 2/3-approximations 
with linear work in parallel [5120] ? 
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A More Experimental results 
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Fig. 6. Ratio of the weights computed by GPA and other algorithms for random 
metric graphs and running time. 
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Fig. 7. Ratio of the weights computed by GPA and other sequential algorithms (left) 
and their timing (right) for random graphs with a = 4. 
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Fig. 8. Ratio of the weights computed by GPA and other sequential algorithms (left) 
and their timing (right) for random graphs with a = 16. 
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and their timing (right) for random graphs with a = 64. 
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B List of Instances 



file 

2cubcs_sphcre.nitx. graph 
aL0_kl01.nitx. graph 
af_l_kl01.mtx. graph 
af_2_kl01.nitx. graph 
af_3_kl01.mtx. graph 
af_4_kl01 .mtx. graph 
af_5_kl01.mtx. graph 
af_shelll .mtx.graph 
af _shell2 . mtx . graph 
af _shcU3 . mtx. graph 
af_shell4. mtx.graph 
af _shell5 . mtx . graph 
af _sheh6 . mtx. graph 
af_shell7.mtx.graph 
af _shell8 . mtx. graph 
af _shcU9 . mtx. graph 
apache2. mtx. graph 
BenElechil .mtx.graph 
bmw3_2. mtx. graph 
bmw7st_l. mtx. graph 
bmwcra_l. mtx. graph 
boneSOl. mtx. graph 
boydl. mtx. graph 
c-73. mtx. graph 
c-73b. mtx. graph 
c-big. mtx. graph 
cant. mtx. graph 
casc39. mtx. graph 
case39_A_01. mtx.graph 
case39_A_02. mtx.graph 
case39_A_03. mtx.graph 
case39_A_04. mtx. graph 
case39_A_05. mtx. graph 
case39_A_06. mtx. graph 
case39_A_07. mtx. graph 
case39_A_08. mtx. graph 
case39_A_09. mtx. graph 
case39_A_10. mtx. graph 
case39_A_ll. mtx. graph 
casc39_A_12. mtx. graph 
case39_A_13. mtx. graph 
cfdl. mtx. graph 
cfd2. mtx. graph 
CO. mtx. graph 
consph. mtx. graph 
cop20k_A. mtx. graph 
crankseg_l .mtx.graph 
crankseg_2. mtx.graph 
ct20stif.mtx.graph 



n 

101492 
503625 
503625 
503625 
503625 
503625 
503625 
504855 
504855 
504855 
504855 
504855 
504855 
504855 
504855 
504855 
715176 
245874 
227362 
141347 
148770 
127224 
93279 
169422 
169422 
345241 
62451 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
40216 
70656 
123440 
221119 
83334 
99843 
52804 
63838 
52329 



772886 
8523525 
8523525 
8523525 
8523525 
8523525 
8523525 
8542010 
8542010 
8542010 
8542010 
8542010 
8542010 
8542010 
8542010 
8542010 
2051347 
6452311 
5530634 
3599160 
5247616 
3293964 
558985 
554926 
554926 
997885 
1972466 
516021 
516021 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
516026 
878854 
1482229 
3722469 
2963573 
126221a 
5280703 
7042510 
1323067 



file 


n 


m 


darcyOOS. mtx. graph 


389874 


933557 


dawson5 . mtx. graph 


51537 


479620 


denormal . mtx. graph 


89400 


533412 


dielFilterV2clx. mtx. graph 


607232 


12351020 


dielFilterV3clx. mtx. graph 


420408 


16232900 


Dubcova2. mtx.graph 


65025 


482600 


Dubcova3 . mtx. graph 


146689 


1744980 


d_pretok. mtx. graph 


182730 


756256 


ecology 1 . mtx. graph 


1000000 


1998000 


ecology 2 . mtx. graph 


999999 


1997996 


engine . mtx. graph 


143571 


2281251 


Fl. mtx. graph 


343791 


13246661 


F2. mtx. graph 


71505 


2611390 


Fault_639. mtx. graph 


638802 


13987881 


filter 3D. mtx.graph 


106437 


1300371 


G3_circuit. mtx. graph 


1585478 


3037674 


Gal0Asl0H30.mtx.graph 


113081 


3001276 


Gal9Asl9H42.mtx.graph 


133123 


4375858 


Ga3As3H12.mtx.graph 


61349 


2954799 


Ga41As41H72.mtx.graph 


268096 


9110190 


GaAsH6. mtx.graph 


61349 


1660230 


gas_sensor. mtx. graph 


66917 


818224 


Ge87H76. mtx. graph 


112985 


3889605 


Ge99H100.mtx.graph 


112985 


4169205 


gsm_106857. mtx. graph 


589446 


10584739 


H20. mtx. graph 


67024 


1074856 


helm2d03. mtx. graph 


392257 


1174839 


hood. mtx.graph 


220542 


5273947 


IG5-17. mtx. graph 


30162 


1034600 


invext r 1 _new . mtx . gr aph 


30412 


906915 


kkt_power. mtx. graph 


2063494 


6482320 


Lin. mtx. graph 


256000 


755200 


mario002. mtx. graph 


389874 


933557 


mixtank_new.mtx. graph 


29957 


982542 


mouse_gene. mtx.graph 


45101 


14461095 


msdoor. mtx.graph 


415863 


9912536 


m_tl. mtx. graph 


97578 


4827996 


nasasrb . mtx. graph 


54870 


1311227 


ndl2k. mtx. graph 


36000 


7092473 


nd24k. mtx. graph 


72000 


14321817 


nlpkkt80. mtx. graph 


1062400 


13821136 


offshore . mtx . graph 


259789 


1991442 


oilpan . mtx . graph 


73752 


1761718 


parabolic_feni . mtx . graph 


525825 


1574400 


pdblHYS. mtx.graph 


36417 


2154174 


pwtk. mtx. graph 


217918 


5708253 


qaSfk . mtx. graph 


66127 


797226 


qaSfm. mtx. graph 


66127 


797226 



tile 


n 


m 


Sddkq4m2.mtx. graph 


90449 


2365221 


s3dkt3m2 .mtx. graph 


90449 


1831506 


shipsec 1 . mtx. graph 


140874 


3836265 


shipscco . mtx. graph 


179860 


4966618 


shipsccS . mtx. graph 


114919 


3269240 


ship_001. mtx. graph 


34920 


2304655 


ship_003. mtx. graph 


121728 


3982153 


bi34H36.mtx.graph 


97569 


2529405 


Si4lGe41H72.mtx.graph 


185639 


7412813 


Si87H76. mtx. graph 


240369 


5210631 


SiO.mtx.graph 


33401 


642127 


Si02.mtx.graph 


155331 


5564086 


sparsine. mtx. graph 


50000 


749494 


StocF-1465.mtx.graph 


1465137 


9770126 


tSdh.mtx.graph 


79171 


2136467 


t3dh_a. mtx. graph 


79171 


2136467 


thermal2. mtx. graph 


1228045 


3676134 


thread. mtx. graph 


29736 


2220156 


tmt_sym. mtx. graph 


726713 




i DUFr _r o_DiDZ_Co. mtx. graph 




oyD6oo 


TSOPF_FS.bl62_c4.mtx.graph 


40798 


1193898 


TSOPF_FS.b300.mtx.graph 


29214 


2196173 


TSOPF_FS.b300_cl .mtx.graph 


29214 


2196173 


TSOPF_FS.b300_c2.mtx.graph 


56814 


4376395 


TSOPF_FS.b39_cl9.mtx.graph 


76216 


979241 


TSOPFJS.b39_c30.mtx.graph 


120216 


1545521 


turonjn. mtx. graph 


189924 


778531 


vanb dy. mtx .graph 


47072 


1144913 


xl04. mtx.graph 


108384 


5029620 
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